Memory biased random walk approach to synthetic 

clickstream generation 



Nino Antulov-Fantulin 
Division of Electronics, Ruder 
Boskovic Institute, Zagreb, 
Croatia 
nino.antulov- 
fantulin@irb.hr 



Matko Bosnjak 
Department of Informatics 

Engineering, Faculty of 
Engineering, University of 
Porto, Portugal 
Division of Electronics, Ruder 
Boskovic Institute, Zagreb, 
Croatia 

nnatko.bosnjak@gnnail.com 



Vinko Zlatic 

Theoretical Physics Division, 
Rudjer Boskovic Institute, 

Zagreb, Croatia 
ISC-CNR, Dipartimento di 
Fisica, Universita "La 
Sapienza", Roma, Italy 
vinko.zlatic@irb.hr 



Miha Grcar 
Department of Knowledge 
Technologies / E8, Jozef 
Stefan Institute, Ljubljana, 
Slovenia 
nniha.grcar@ijs.si 



Tonnislav Snnuc 
Division of Electronics, Ruder 
Boskovic Institute, Zagreb, 

Croatia 
tonnislav.snnuc@irb.hr 



ABSTRACT 

Personalized recommender systems rely on personal usage 
data of each user in the system. However, privacy poli- 
cies protecting users' rights prevent this data of being pub- 
licly available to a wider researcher audience. In this work, 
we propose a memory biased random walk model (MBRW) 
based on real clickstream graphs, as a generator of synthetic 
clickstreams that conform to statistical properties of the real 
clickstream data, while, at the same time, adhering to the 
privacy protection policies. We show that synthetic click- 
streams can be used to learn recommender system models 
which achieve high recommender performance on real data 
and at the same time assuring that strong de-minimization 
guarantees are provided. 

Categories and Subject Descriptors 

K.4.1 [Computers and society]: Public Policy Issues — 
Privacy; H.2.8 [Database management]: Database ap- 
plications — Data mining; G.3 [Mathematics of Comput- 
ing]: Probability and statistics — Markov processes 
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1. INTRODUCTION 

Advances in information technology constantly increase 
the quantities of information produced, transmitted and stored 
in each second around the world. This drives researchers to 
constantly improve the ability and performance of various 
data mining algorithms. Modern data mining algorithms 
are also able to easily extract unique patterns from data 
and thus pose serious threat for breaching privacy protec- 
tion policies. 

Recommender systems proved to be a very useful per- 
sonal decision support tool in search through vast amounts 
of information on the subject of interest. Recommender 
techniques 1 , 2 are divided into three main categories: 
content-based, collaborative-based and hybrid-based recom- 
mender techniques. Content-based techniques |4] rec- 
ommend items with respect to available structured or un- 
structured knowledge about user's online history; collabora- 
tive filtering techniques 5 recommend items with respect to 
past behaviour and interaction of all of the users and all of 
the items, whereas hybrid-based techniques [6] use combined 
approaches of the previous two techniques. 

Regardless of the used technique, the system must have 
access to the users' private data - data consisting of users' 
history on the website and possibly other personal infor- 
mation. The quality of data, meaning the level of details, 
dictates the efficacy of the system. In the first approxima- 
tion clickstream is a sequence of clicks made by a particular 
user while web browsing, leading from one item of interest 
to another. Clickstream data are considered to be a per- 
sonal information, therefore privacy policies heavily restrict 
the public availability of this data, often rendering it un- 
available for research. 

To overcome these privacy policies, many real data sets 
(Netflix dataset, AOL search logs, Massachusetts Group In- 
surance Commission (GIC) medical encounter dataset, etc.) 
were anonymized by removing all the explicit personal iden- 
tification attributes like name, social security number, etc. 



Nevertheless, successful real world 'linkage attacks', on these 
data sets have been made, where researchers manage to iden- 
tify personal records by linking different datasets by quasi 
personal identifiers like search logs, movie ratings, gender, 
ZIP code, etc. In case of the Netflix challenge dataset all ex- 
plicit personal identification attributes were replaced with 
unique random number, but still the researchers [7] man- 
aged to link the Netflix dataset with the IMDB dataset by 
dates of user's ratings and partly de-anonymize the Netflix 
dataset. Dataset published by Massachusetts Group Insur- 
ance Commission (GIC) was also a target of linked attack 
by researchers ^ who exploited the fact that it is not com- 
mon for two persons to have the same zip code, gender and 
birth date. Sweeney showed [sl that 87% of the U.S. pop- 
ulation are unique with respect to the quasi identifiers like 
ZIP code, gender and date of birth. In August 2006, AOL 
published dataset of around 20 million search queries of 650 
000 users from three months period which contained users 
clikstreams (URLs) from search results. Only a few days 
later, the New York Times reporters linked one clickstream 
with a real user which triggered a lawsuit against the AOL 
in the U.S. District Court in September 2006. 

The paper is organized as follows. In Section [2] we present 
related work on problem of privacy-preserving data publish- 
ing. In Section [3] we describe data generator matrices used 
for the generation of synthetic clickstreams and the biased 
and unbiased random walk models on graphs. In Sections 
[4]and|5] we follow up the previous section and apply a bi- 
ased random walk model to clickstream data. We focus on 
memory biases and user-sampling procedure to construct 
clickstream synthetic datasets. In Section [6] we present ex- 
periments performed to show applicability of our method in 
recommender systems. We provide additional anonymiza- 
tion guarantee for synthetic clickstream datasets in Section 

m 

2. RELATED WORK 

The problem of privacy-preserving data publishing ^27^ 
[28| and privacy-preserving data mining 29 are intensively 
researched within three research communities: database com- 
munity, statistical disclosure community and cryptography 
community. Different privacy protection models have been 
proposed in order to counter possible privacy attacks like 
record linkage, attribute linkage, table linkage and proba- 
bilistic attack. 

Record linkage models like k- Anonymity model [30] [Sl] 
[32] assure that the number of records with some quasi- 
identifier id is at least k and therefore assure the value of 
linkage probability at most 1/k. Attribute linkage models 
like L-diversity [33] [34] are envisioned to overcome the prob- 
lem of inferring the sensitive values from k anonymity groups 
by decreasing the correlations between quasi-identifiers and 
sensitive values. L-diversity model assures that the entropy 
of sensitive attributes in each group is larger than some 
threshold value /. A high value of / implies smaller probabil- 
ity of inferring sensitive values. Probabilistic models like e- 
differential privacy model 35 ensure that the difference be- 
tween the prior and posterior beliefs is small enough. The e- 
differential privacy model ensures that individual's presence 
or absence in the database does not effect the query output 
significantly. Post-random perturbation (PRAM) methods 
[36j [37j change original values through probabilistic mecha- 
nisms and thus, by introducing uncertainty into data, reduce 



the risk of re- identification. 

Synthetic data generation is an alternative approach to 
data protection in which the model generates synthetic dataset 
preserving the statistical properties of original dataset. Sev- 
eral approaches for synthetic data generation have been pro- 
posed: (i) synthetic data generation by multiple imputa- 
tion method [38], (ii) synthetic data by bootstrap method 
[39| (estimating multi-variate cumulative probability distri- 
bution, derive similar c.d.f. and sample synthetic dataset), 
(iii) synthetic data by Latin Hypercube Sampling (mul- 
tivariate synthetic dataset), (iv) and others such as a com- 
bination of partially synthetic attributes and real non con- 
fidential attributes [4l] [42] . 

Most of the aforementioned anonymization strategies were 
developed for database records with fixed number of at- 
tributes but not for sequences such as clickstream data. This 
led us to propose a method for synthetic clickstream gen- 
eration based on random walks. Random walks [9], [10| , 
[TT] , [T2j have been previously used for construction of rec- 
ommender systems for different types of graph structures 
emanating from users' private data, but not for generation 
of synthetic clickstreams. We propose an approach for syn- 
thetic clickstream generation by constructing a memory bi- 
ased random walk model (MBRW) on the graph of the click- 
stream sequences, which is a subclass of Markov chains [l3], 
[14] . We also use the MBRW model to generate synthetic 
clickstreams with similar statistical properties to the real 
clickstreams. The MBRW algorithm can be understood as 
a clickstream generation and also an anonymization process. 
Re-identification is harder and uncertain due to the fact that 
the synthetic clickstreams are results of discrete stochastic 
process. Furthermore, we can provide even stronger privacy 
protection guarantee w.r.t. (O^uj) de-anonymization defini- 
tion ^. 

Many data-privacy researchers state that high dimensional 
data poorly resist to de-anonymization 7 which is a seri- 
ous problems for companies, and prevents usage of real- life 
datasets for research and for data-mining challenges. Cur- 
rently, 1 million dollar worth Overstock.com recommender 
challenge [44] is running in which synthetic data, which 
shares certain statistical properties with real data sets, was 
released. They state that instead of releasing sensitive data, 
they can bring the recommender code to the data in the 
cloud. In the final round of the challenge contestants codes 
will be uploaded to RecLabs [45j core server to build models 
and to evaluate them against the real data. They only claim 
that their synthetic data may share certain real statistical 
properties and should be used just to test if code works. 
It would be useful both for contestants and the company if 
synthetic data could be also used as a precursor for model 
performance. In the rest of the paper we demonstrate that 
synthetic data generated with our method are indeed a good 
testing ground for recommender systems. 

3. RANDOM WALK MODELS 
Data generator matrices 

Clickstream is a sequence of web pages or items in general 
visited by a user. A clickstream is an ordered sequence of 
web pages — ul,ul,ul, ...,ul, visited by a particular user 
i. If the web page u'j was visited before the web page u\ by 
user i, then j <i k. The inequality relation "less than": <i is 
a function of particular user i. The set of all the clickstreams 



in a system is C — {c^ , , . . . , , . . . , } . 

We define two characteristic matrices for the set of chck- 
streams C: (i) Direct Sequence Matrix {DS) and (ii) Com- 
mon View Score Matrix {CVS). Matrix element DS[m^n] 
is the number of chckstreams in C in which the web page 
m followed immediately after the web page n. The Matrix 
element CV S[m^n] is the number of occurrences in which 
the web page m and the web page n belong to the same 
clickstream of C. Therefore, DS is not a symmetric matrix, 
whereas CVS is. The matrix DS represents an adjacency 
matrix of a weighted directed graph (V^Eds) whose set of 
vertices V represent web pages or items in the particular 
system and Eds represents weighted edges from DS ma- 
trix. The matrix CVS represents an adjacency matrix of a 
weighted undirected graph {V^ Ecvs)^ whose set of vertices 
V represent web pages or items in the particular system and 
Ecvs represents weighted edges from CVS matrix. 

Unbiased random walk 

In this subsection, we consider a case of a random walk 
on the large connected component in the non- weighted and 
undirected graph. We transform the weighted CVS graph 
to non- weighted graph. Random walk starts at an arbi- 
trary vertex Vk on CVS graph. At each discrete time step 
n, random walker resides at some vertex Vm and randomly 
chooses an adjacent vertex vi from a uniform distribution of 
adjacent vertices. Let pi{n) denote a probability that the 
random walker will be at vertex vi at discrete time n: 



expressed as: 



ECVSij , 



1) 



in which kj denotes the degree of a vertex Vj . In matrix form 
we can write p(n) = CVS x D~^^(n - 1) = Tp(n - 1), in 
which D is a diagonal matrix with the degrees of vertices ki 
down the diagonal and T is a transition matrix. Stationary 
distributions p(cxo) or just p can be expressed as: 

p = CVS X D-^p ^ 

(/ - CVS X D-^)^ = {D- CVS) X D-^p = LD'^p = 

in which matrix L is the Laplacian matrix of the graph CVS 
and D~^p is the eigenvector with the corresponding zero 
eigenvalue. Thus, we know that D~^p = cl for every con- 
nected graph, in which c is a constant and 1 denotes vector 
whose components are all ones. This implies that the sta- 
tionary probability distributions are: 



■ cki 



ki 



Therefore, we conclude that the stationary probability of a 
random walker being at a vertex Vi is proportional to its 
degree ki [Im. 



Biased random walk 

In this subsection we consider a case of a biased random walk 
[l6] on a large connected component in the weighted CVS 
graph, which is undirected. Elements of the weighted adja- 
cency matrix CVS of are used as biases for random walks. 
Let us denote pi{n) probability that the random walker will 
be at vertex Vi at discrete time n. This probability can be 



^(n) = Yl 



CVSi^ 



EkCVSkj 



Pj - 



In matrix form we can write p{n) — CVS x Z~^p(n — 1) = 
p(n) = Tp{n — 1), in which Z is a, diagonal matrix with the 
elements J2k CVSki down the diagonal and T is a transi- 
tion matrix. By analogue with the unbiased case stationary 
distributions p can be expressed as: 



p = CVS X Z" p 



4. MBRW MODEL 

The random walk of this model takes place on the DS 
graph and uses a combination of biases from the DS and 
CVS matrices. This discrete time Markov chain model has 
a finite memory of m past states. Initial vertex can be chosen 
either by stochastic or deterministic rule. Let us denote the 
initial vertex as ui , then the model chooses the next adjacent 
vertex U2 with the probability of: 



DSu 



{U2\ui} 



X^fc DSk,ui 



thus generating a clickstream = {ui,U2}. The third lec- 
ture. Us in the clickstream is chosen with the probability 
of: 



{^31^2,-"!} 



DSuQ ,U2 CV Su;^ ,Ui 

y^fe DSk,u\CV Sk,ui 



thus generating a clickstream = {ui^U2^us}. Using a 
finite memory of size m, we choose the vertex Un with the 
probability of: 



{Un liz-ri 



^ J2j DSj,Ur,-l nr=l CVSj,u^_k-l 



thus generating a clickstream = {ui,U2,U3, ...,Un} at the 
n-th step of the random walk. Clickstream length L is a 
random variable sampled from a discrete probability distri- 
bution like Poisson, negative binomial, geometric, or from 
the real clickstream length distribution, if available. Us- 
ing this model, we generated a set of chckstreams C — 
{c^, c^, c^}. In each of the K independent iterations, 
we determine the clickstream length / and the initial vertex 
of the random walk. At the end of each iteration i, random 
walk path = {u\,U2, ...,ul} determines one clickstream 
appended to the synthetic clickstream set C. 

5. MBRW MODEL WITH THE USER-PROFILE 
SAMPLING PROCEDURE 

Let n and q denote total number of users and items in a 
particular system, respectively. The clickstream set of such 
system is denoted with C — {c^, c^, c^}, in which rep- 
resents a clickstream of a particular user i. We can construct 
a user-item usage matrix A from the clickstream set C by 
breaking the ordering of items in chckstreams. User-item 
matrix A is element of R^^^ space and contains non-zeros 
element A(i^j) if and only if the clickstream of the user 
i contains the item element j. The user- item matrix can 
be represented as a bipartite graph B, in which first type 
of vertices are users and second type of vertices are items. 
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Figure 1: Diagram of MBRW user-profile sampling 
procedure 



Matrix A is a one-mode projection of bipartite graphs 
to items and is closely related to CVS matrix. Therefore, 
both DS and CVS matrices represent item preference mod- 
els and random walks on direct sequence and common view 
score graphs can only preserve item preferences. 

In order to preserve user preferences with random walk 
models we introduce user-profile sampling procedure as an 
input memory parameter to the MBRW model. For each 
synthetic clickstream we want to create, we first randomly 
sample one real clickstream c' from the clickstream set C. 
Let us assume that the sampled real clickstream c-^ consists 
of the following items {xi, Xs-i,Us, i^e, a^e+i, •••}• Then 
we create a profile p-^ = {us, ...,Ue} of user j by taking some 
random sequence subsample of clickstream between start 
index s and end index e, which are random variables. In 
Figure [l] we can see diagram of user-profile sampling pro- 
cedure. In the next section, we will explain how to sample 
indexes s and e in more details. Then, we use created user- 
profile sequence = {us,...,Ue} as a memory component 
to MBRW model. More precisely, we take the user-profile 
p-^ as the first part of synthetic clickstream = {us, ...,Ue} 
and choose the next item Ue+i with the probability of: 



DS^^ 



e+lke.te-l- 



,) 



in which Z denotes the set of neighbours of item Ue in DS 
graph. If Z set is empty set, we choose the next item Ue+i 
with the probability of: 

cv Sue+i,ue DfeLi cv Sue^i,u^_k-i 



in which Z denotes the set of neighbours of item Ue in CVS 
graph. We then continue to use MBRW model until we have 
generated / new synthetic items. Note, that the value / is 
also a random variable which is sampled from some proba- 
bility distribution L. In the end we have generated a syn- 
thetic clickstream = {i^s, i^e, i^e+i, 't^e+i.} with first 
e — s original items and / new synthetic items. As a result 
of using user-profile sampling with MBRW model we gener- 
ate synthetic clickstream set C satisfying the user and item 
preferences. The pseudo code for clickstream MBRW algo- 
rithm with the user-profile sampling procedure is provided 



in Algorithm [T] 



Algorithm 1 Clickstream MBRW model with the user- 
profile sampling procedure 

Input: C = {c^, c^, c^} - real clickstream set, 

DS - Direct Sequence Matrix, 

CVS - Common View Score matrix, 

K - number of synthetic clickstreams, 

G - anonymization constant, 

E, M, L: probability distributions 

Output: C — {c^ , ...,c^} synthetic clickstream set 
C = 

for i = 1 : K do 

c = take random clickstream from C; 

lie— {Xl, X2, Xs-\,Us, l^e, Xe + 1, Xe+2, ■■.} 

e = sample "end index" from p.d. E\ 
m — sample "size of memory" from p.d. M; 
s — e — m] I /generate start index 
profile = subsample of c from s until e; 
Ci = profile] 

1 1 Ci — {us^ i^e} - temporary clickstream 
/ = sample number of hops from p.d. L; 
for j — 1 \ \ do 

Z = find neighbours of item Ue+j-i in DS graph; 

if Z not empty set then 

Choose the next item Ue+j with the probability of: 

_ ^*S'we+j,^e+i-l nfc = l CV Suej^j,U^_k-l 



{Ue+j |ci} 



else 



Z = find neighbours of item Ue+j-i in CSV graph; 
Choose the next item Ue+j with the probability of: 



{Ue+j |ci} 



E.gz cvs.,u.+,-, nr=i cvs.,^^_,_. 



end if 

if Z empty set then 

end clickstream Ci and break; 
end if 

Ci — CiVJ Ue+j ; / / append new item 
end for 

I ICi = {Us, Ite, i^e+l, ...,Ue+l} 

if 3c G C : simici^c) > G then 
/ / de-anonymization criterion 
discard clickstream Ci and continue; 

else _ 

C = C U Ci ; - append new synthetic clickstream 
end if 
end for 



6. EXPERIMENTS AND ANALYSIS 

Here, we state five hypotheses about the properties of our 
clickstream MBRW model with the user-profile sampling 
procedure: 

1. Basic statistical properties of DS and CVS matrices of 
synthetic clickstream set C are preserved if we generate 
a synthetic clickstream set of sufficiently large size. 



2. Memory property of our model increases the probabil- 
ity of choosing the relevant next item. 

3. User-preferences are largely preserved due to the user- 
profile sampling procedure although random walk takes 
place on item-preference graphs even for very small 
values of memory parameter. 

4. Synthetic clickstream sets can be used by Recommender 
Systems to learn a model on an anonymized dataset 
while achieving high recommendation performance on 
real test data. 

5. We can provide guarantee that there exist no synthetic 
clickstream that has similarity to some real clickstream 
above some threshold G. 

Now, we describe all the experiments we made in order to 
confirm our hypotheses. In these experiments we used a 
sample dataset created from the Yahoo! Music community's 
preferences to various musical items, released for the KDD- 
Cup 2011 43 . We downsampled the dataset, and created 
a sub-sample of the modest dimensions (10000 users over 
5000 items), in order to reduce computational load in nu- 
merous experiments. The set of items rated by users or- 
dered in ascending time order represents a set of original 
clickstreams in our experiments. We use "vertical" and "hor- 
izontal" splits of the clickstream dataset to create training 
and test datasets. Diverse splits were necessary to prove dif- 
ferent hypotheses. "Horizontal split" of clickstream dataset 
C creates two disjoint clickstreams sets C train and Ctest of 
fixed sizes. "Vertical split" of clickstream dataset C creates 
two clickstream sets by a temporal cut. More precisely all 
the items in a clickstream prior to some specific time t are 
put to the Ctrain sct and all the items in a clickstream af- 
ter time t are put in Ctest set. We also apply both splits 
together to create two disjoint sets of clickstreams of fixed 
sizes and then make a vertical split on one of them again. 
We explain for each experiment, the logic of the particular 
split. These splits are graphically represented in the Figure 
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Figure 2: Three ways of splitting the original click- 
stream set used in computational experiments: A - 
Horizontal split, B - Vertical split and C - Horizontal 
and vertical split 



Matching basic statistical properties experiment 

In this experiment we examined how statistical properties of 
the item preference matrix like DS and CV S are preserved 



in synthetic clickstream set with respect to the original click- 
stream set. We used "horizontal split" of our clickstream 
dataset C to two sets Ctrain and Ct&st of 9000 and 1000 click- 
streams sizes, respectively. Then, we calculated the DS and 
CVS matrices of Ctrain dataset and created the synthetic 
clickstream set C by using MBRW model with user-profile 
sampling procedure. We used end index parameter e which 
was sampled from uniform distribution on range from 1 to 
length of particular sampled clickstream, memory parame- 
ter m which was sampled from Gaussian distribution (^(3, 2), 
number of random walk hops parameter / which was sampled 
from Gaussian distribution 6^(9, 2) and number of synthetic 
clickstreams parameter K varying from 10^ -10^ When we 
have synthetic clickstream set C, we calculate its statistical 
properties DS and CVS and compare it to the statistical 
properties of original dataset DS and CV S. It turns out 
that the various matrix norms like Frobenius and max jiorm 
on the difference matrices (DS — DS) and (CVS — CVS) 
are not appropriate as they do not capture order or ranking 
preservation between corresponding clickstreams. Therefore 
we used Spearman's rank correlation measure between the 
corresponding rows in {DS,m) and {CVS, CVS). 



Table 1: Average rank correlation between (DS^DS) 
and {CV S ,CV S) for different sizes (K) of generated 
synthetic clickstream set. Synthetic clickstream 
set is created using parameter m sampled from 
Gaussian distribution ^(3,2), parameter 1 sampled 
from Gaussian distribution C{9, 2) and parameter e 
sampled from Uniform distribution on clickstream 
length. 



Size 


AVG[r{DS,DS)] 


STD[r{DS,DS)] 


K = lO'^ 


0.5700 


0.3210 


= 5 * 10^ 


0.8261 


0.3060 


K = 10^ 


0.8914 


0.2224 


i^^ = 5 * lO'^ 


0.9308 


0.0639 


K = 10^ 


0.9294 


0.0590 




AVG[r{CVS, CVS)] 


STD[r{CVS, CVS)] 


K = 10^ 


0.4545 


0.2677 


i^^ = 5 * 10^ 


0.5530 


0.2407 


K = 10^ 


0.6050 


0.2120 


i^: = 5 * 10^ 


0.7071 


0.1765 


K = 10^ 


0.7361 


0.1784 



Due to the fact that these matrices are sparse and that 
in the process of recommendation only top ranked items are 
relevant, we have used rank correlation only for the first z — 
100 important elements. Rank correlation between complete 
rows would be misleadingly high due to the row sparsity. Av- 
erage rank correlation coefficient AVC[r{DS, DS)] — 0.92 
and AVG[r{CV S, CVS)] = 0.73 over all corresponding rows 
was obtained for first z most important elements, with the 
above parameters and K — 10^. The rank correlation coef- 
ficients for different values of parameter K can be seen in 
TableJ^ This shows highly correlated statistical properties 
(DS^DS) and (CVS, CVS). Although intuitively clear, it 
is still important to stress that high rank correlation cannot 



be achieved using arbitrary DS and CVS matrices. When 
two the matrices DS and CVS are constructed from the 
same cUckstream set C, we say that they are "aUgned". In 
order to achieve high similarity between both statistics {DS, 
DS) and (CVS, CVS), the biases for random walk DS and 
CVS need to be "aligned". We have generated synthetic 
clickstream set Cna with the K = 10^ with two matrices 
DSna and CVSna which are not "aligned" (not constructed 
from the same clickstream set C). In this case we got low 
similarity between (CVSna, CVSna) and high similarity be- 
tween (DSna^na) (scc Table [ij . 

Table 2: Statistical properties of {CV SnaiCV Sna) and 
{DSna, DSna) for "non-aligned" matrices and K = 10^, 
parameter m sampled from Gaussian distribution 
^(3,2), parameter / sampled from Gaussian distri- 
bution (7(9,2), parameter e sampled from Uniform 
distribution of clickstream length 





AVCr 


STDr 


{CVSna,CVSna) 


0.1844 


0.1706 


(DSna, DSna) 


0.8921 


0.2048 



Importance of memory - vertical split experi- 
ments 

In this experiment we did a "vertical split" of our real click- 
stream set C by a cut t in the sequence of items and created 
two datasets: Ctrain and Ctest, prior and after the time t, 
respectively. Through this experiment we determine how 
different values of memory affect the probability to generate 
the relevant item from Ctest set. More precisely, we create 
DS and CVS matrices from whole clickstream set C and 
for each clickstream d from Ctrain set with the memory m 
we calculate the probability to choose the first next rele- 
vant item in Ctest set. For example, if we have clickstream 
Ctrain = {ui , U2 , U3 , U4, U5 , uq} from Ctrain and the corre- 
sponding clickstream cl^st — {ur^ug,...} in Ctest then the 
first next relevant item for user i is U7. Relevant next item 
probability is defined as the probability of choosing the item 
ur over all other items in MBRW model by using memory 
m. As expected, we observe that by introducing the memory 
to random walk model we can increase the average relevant 
item probability by approximately 60 percent on this dataset 
(see Figure [3]). 

To assess more realistic situation, that of a real recom- 
mender system, in which we assume only partial knowledge 
of C dataset, thus also a partial knowledge of DS and CVS 
matrices we perform another "vertical split" experiment. In 
this experiment we do the same temporal split like in the 
previous experiment and create two datasets: Ctrain and 
Ctest- Here, we create DS and CVS from Ctrain dataset 
only, in contrast to the previous experiment in which we 
created them from whole C dataset. This means that we 
got the partial information and still measure the relevant 
next item probability. Note, that in this case of partial DS 
and CVS matrices the Z set of neighbouring nodes of last 
item in clickstreams of Ctrain in DS graph can be empty set. 
Then we calculate the relevant item probability by walking 
in CVS graph. From the Figure |3] we can see that by intro- 
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Figure 3: Average relevant item probability with re- 
spect to memory of MBRW model, when complete 
and partial information about the DS and CVS ma- 
trices are available 



ducing the memory to random walk model, we can increase 
the average relevant item probability by approximately 42 
percent on this dataset by using partial information of DS 
and CVS matrices. 

Assessing preference profile preservation 

In this experiment we want to see how well can MBRW 
model with user-profile sampling procedure as input repro- 
duce user preference structure of the original clickstream set. 
We define profile of user i to be any subsequence of click- 
stream a in the clickstream set C. We created a user-item 
usage matrix A from the real clickstream set C. Remem- 
ber, that matrix A contains non-zero elements A(i,j) if and 
only if user i consumed item j. Typically, matrix A is very 
sparse, thus in order to compare more efficiently original and 
synthetic usage matrix, or virtual-user profiles, we mitigated 
the sparsity problem by projecting into a dense reduced di- 
mensional space using matrix factorization. Singular Value 
Decomposition (SVD) is a matrix factorization technique of 
an m X n matrix A of the form: A = USV . If A is a real 
matrix, SVD is a factorization of the form A = USV^ , in 
which U is an unitary matrix of left singular vectors of A, 
matrix V is an unitary matrix of right singular vectors of A, 
and matrix 5* is a diagonal matrix whose non-zero elements 
are singular values of A. Best r low rank approximation 
with respect to the Frobenius norm of matrix A is matrix 
which takes r greatest singular values and left and right sin- 
gular vectors to obtain matrix Ar = Ur x Sr x V^ . The last 
statement is known as the Eckart- Young theorem |17j. The 
matrix Ur represents user-preference in r dimensional latent 
space and the matrix Vr represent items in r dimensional 
latent space [18| . 

To ensure mapping between the real and the synthetic 
user, we set the number of synthetic and real users to be 
equal (K = n), and changed the sampling procedure to sam- 
ple each real clickstream only once. Then, we created syn- 
thetic clickstream set C and corresponding user-item usage 
matrix A. Afterwards, we permuted the rows of matrix A 
so that corresponding rows represent same users as matrix 
A. Note, when the numbers of artificial and real users are 
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Figure 4: Histogram of cosine similarity between 
corresponding rows of real user-preferences Uk and 
synthetic user-preference Uk in reduced latent fea- 
ture space r = 100. Synthetic clickstream was gen- 
erated with parameter m sampled from Gaussian 
distribution G{3, 2) and parameter / sampled from 
Gaussian distribution ^(5,2) and K ^ 9000. 



equal, the constraint that each real clickstream can be sam- 
pled only once is a good approximation because user-profile 
sampling procedure uses Uniform distribution. 

To compare re al a nd synthetic user-preferences we do a 
fold-in mapping 19 of every user vector Ui from A to r 
dimensional latent feature space of matrix Ar. Mapping 
synthetic user vectors Ar to Ur is made by the following 
transformation: 



Ur — Ar ^ Vr ^ 'S'^ 

Now, we have two matrices {Ur,Ur) of dimension n x r in 
latent low dimensional space representing real and synthetic 
user-preference. We use cosine similarity on vectors between 
the corresponding rows of Ur and Ur matrices. 

In Figure |4] we present histogram of cosine similarity be- 
tween corresponding rows of real user-preferences Ur and 
synthetic user-preference Ur in reduced latent feature space. 
Cosine similarity ranges from -1 indicating dissimilarity to 
1 indicating similarity. Figure |4] shows that there is signifi- 
cant positive offset in similarity which reflects preservation 
of user preferences. 

To gain more insight, we experimented by changing pa- 
rameters m and / over a set of deterministic values. For 
that reason we have generated 121 different synthetic click- 
streams sets Cmi with parameters m and / varying in interval 
[0-10]. Figure [5] depicts that only small amount of memory 
is needed so that the average cosine similarity between the 
real and the synthetic users becomes significantly higher. 
Figure [6] depicts that average cosine similarity between the 
real and the synthetic users drops very slowly when we in- 
crease number of random hops. 
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Figure 5: Average cosine similarity in reduced latent 
feature space r = 100 over all corresponding pairs 
of real and synthetic users and K ^ 9000. Memory 
parameter m and number of random hops parameter 
/ are varying in interval [0-10]. 
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Figure 6: Average cosine similarity in reduced latent 
feature space r = 100 over all corresponding pairs 
of real and synthetic users and K ^ 9000. Memory 
parameter m and number of random hops parameter 
/ are varying in interval [0-10]. 



Recommender system performance experiment 

In this set of experiments we tested the quality of synthet- 
ically generated clickstreams by comparing performance of 



two recommendation models based on original C and syn- 
thetic set of clickstreams C. Figure [t] illustrates the ex- 
perimental setup. We have used "horizontal split" of our 
clickstream dataset C to two sets Ctrain and Ctest of 9000 
and 1000 clickstreams sizes, respectively. We use Item-based 
k-nn algorithm [20] as recommender to generate models for 
the comparison. This algorithm takes user-item usage ma- 
trix Atrain, coustructed from Ctrain dataset and generates 
a model M. Model M is a item to item similarity ma- 
trix reduced in a way that each column i contains only k 
most similar non-zero elements. Then, we take Ctest click- 
stream set and do an additional "vertical or temporal split" 
to two sets Cquery and Cresuit (scc Flgurc [2] C part). Item- 
based k-nn algorithm produces top-N recommendation list 
for each clickstream in Cquery dataset and evaluates predic- 
tions against ground truth solution Cresuit using Recall@n, 
measure used often in information retrieval. Recall^n tell 
us what is the probability that a relevant items is retrieved 
in top-N recommendation list. 
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Figure 7: Experimental diagram for testing the rec- 
ommender system performance on original C and 
synthetic set of clickstreams C. 



In first experiment of this kind we generated the synthetic 

clickstream set Ctrain from Ctrain with parameters m sam- 
pled from Gaussian distribution ^(3, 2) and / sampled from 
Gaussian distribution ^(5,2) and K ^ 9000. Item-based 
k-nn algorithm (k=15) is then used to build two recom- 
mender models M and M using Ctrain and Ctrain 5 respec- 
tively. These models are then used to generate predictions 
for the same query set Cquery, and obtained results are eval- 
uated against Cresuit using Recall(^5 evaluation measure. 
Figure [S] gives the results of these two models represented as 
frequency histograms over particular Recall(^5 values. High 
agreement in histogram profile implies that original user 
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Figure 8: Histograms of reca//@5 performance of RS 
with real model M and synthetic model M. Average 
recalim for RS(M)=0.2251 and for RS(M)=0.2021. 



profiles are well preserved in synthetic clickstream dataset 
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Figure 9: Average recall@5 for different models Mmi 
on Cquery and Cresuit cHckstream datasets and per- 
formance of real model M (red curve) 



To get more detailed picture of the performance of recom- 
mender systems, for synthetic datasets generated with dif- 
ferent deterministic values of parameters m and /, we have 
generated 121 different synthetic clickstreams sets Cmi with 
parameters m and / varying in interval [0-10]. For each syn- 
thetic clickstream set Cmi we made a different model of Mmi 
and evaluate its performance as in the previous experiment. 
In Figure [9] we can see the average reca//@5 performance 
of recommender system that uses Mmi on the Cquery and 
Cresuit clickstream dataset. Relatively steep rise in perfor- 
mance of the synthetically generated models with m > 3 
and / > 3 show that it is possible to generate recommenda- 
tion models similar to those generated using original click- 
streams. 



User-based k-nn 


AUG 


prec@5 


prec@10 


prec@15 


NDGG 


MAP 


Real performance 


0,93 


0,20 


0,17 


0,15 


0,47 


0,19 


Synthetic performance 


0,87 


0,17 


0,15 


0,13 


0,43 


0,15 


Item-based k-nn 


AUG 


prec@5 


prec@10 


prec@15 


NDGG 


MAP 


Real performance 


0,93 


0,26 


0,23 


0,21 


0,52 


0,26 


Synthetic performance 


0,87 


0,21 


0,18 


0,17 


0,47 


0,20 


BMF- Factorization 


AUG 


prec@5 


prec@10 


prec@15 


NDGG 


MAP 


Real performance 


0,95 


0,21 


0,15 


0,12 


0,46 


0,19 


Synthetic performance 


0,92 


0,21 


0,16 


0,13 


0,46 


0,19 


WRM- Factorization 


AUG 


prec@5 


prec@10 


prec@15 


NDGG 


MAP 


Real performance 


0,91 


0,27 


0,23 


0,21 


0,53 


0,26 


Synthetic performance 


0,83 


0,19 


0,17 


0,15 


0,46 


0,18 



Table 3: Performances of different recommender systems using models generated from original and synthetic 
clickstreams, tested against same test set Cresuit* 



Final set of recommender quality experiments involved 
assessment of synthetically generated recommendation mod- 
els for diverse algorithms [22] [2l] (Weighted user-based k- 
nn [26j, Weighted item-based k-nn ^0 , Biased Matrix Fac- 
torization [23], Weighted Regularized Matrix Factorization 
[25] ) and using other more different measures (AUG, NDGG, 
MAP, precision). The results are given in the Table [s] for 
the synthetic clickstreams generated using MBRW with pa- 
rameters m sampled from Gaussian distribution (^(3,2), / 
sampled from Gaussian distribution 6^(5, 2) and i^=9000. 

7. ANONYMIZATION 

Most of the anonymization strategies fail to guarantee pri- 
vacy on real datasets. Real world datasets like recommenda- 
tions, preferences, transactions, etc. tend to be high dimen- 
sional and mostly sparse. Furthermore, de- anonymization 
attack algorithms can use any available background knowl- 
edge along with probabilistic reasoning to breach privacy. 
Narayanan and Shmatikov 7 shown that there exist lim- 
its of privacy in public datasets. They, also constructed a 
formal model for privacy breaches in anonymized datasets. 
We provide only basic notion and definitions from their work 
[7] to demonstrate anonymization capabilities of the MBRW 
model. We start with the definition of (e, ^)-sparsity of the 
database. A database D is (e, ^)-sparse w.r.t. the similarity 
measure sim if: 

P(sim(r,r ) > e,Vr r) <0. 

Similarity measure is a function that maps two user vectors 
in dataset to the interval [0, 1]. 

A database D can be (B,c<;) de-anonymized w.r.t. auxiliary 
information Aux if there exists an algorithm A which, on 
inputs D and Aux{r) in which r ^ D outputs r such that: 

P{sim{r,r ) > B) > cj. 

Our MBRW model with user-sampling procedure gener- 
ates partially synthetic dataset. Due to the fact that these 
clickstreams are results of discrete stochastic process pri- 
vacy is preserved in a way that re-identification is uncertain. 
However, MBRW approach also provides re-identification 
protection guarantee w.r.t. de-anonymization defini- 

tion via simple clickstream privacy filtering. 

To illustrate the procedure of clickstream privacy filter- 
ing we have generated the synthetic clickstream set Ctrain 
from Ctrain with parameters m sampled from Gaussian dis- 



tribution ^(3, 2) and parameter / sampled from Gaussian 
distribution 6^(5, 2) and K ^ 25000. We have generated 
K ^ 25000 synthetic clickstreams from the original click- 
stream set Ctrain, which contains ^ 9000 real clickstreams. 
We can then simulate de-anonymization attack by assum- 
ing that attacker possess perfect knowledge of the original 

dataset Ctrain 1 

used to generate our synthetic dataset. In 
order to provide de-anonymization protection guarantee for 
our synthetic dataset it is sufficient to calculate the similar- 
ity between all pairs of clickstreams from Ctrain and Ctrain , 
and filter out those clickstreams from Ctrain that have sim- 
ilarity over some threshold G. To illustrate the proportions 
of the dataset that has similarity to original clickstreams 
above some threshold G, we have calculated all the pairwise 
similarities between the original and synthetic clickstreams. 
The distribution is depicted in Figure |10| For similarity 
function we have used cosine similarity between vectors of 
users. After pruning the clickstream Ctrain set we guarantee 
that there exist no synthetic clicstream that has similarity 
to some real clickstream above some threshold G or that : 

P{sim{r,r) > G) = 0. 

If for example we set the threshold G to be 0.7, then after 
pruning, the synthetic clickstream Ctrain set will contain ap- 
proximately 80% of its original size K ^ 25000. Note, that 
we have assumed that the attacker obtained 100% of real 
dataset and that the similarity between clickstreams is cal- 
culated as similarity of binary vectors where ordering is not 
relevant. In real life scenario attacker typically has access 
to small sample of original clickstream set along with their 
identification attributes. In our experiments we obtained 
small impact on recommender performance for G as low as 
0.60. 

8. CONCLUSIONS 

The principle aim of our work was to construct genera- 
tor of real-like clickstream datasets, that preserve structure 
of original user-item preferences but is at the same time ad- 
dressing privacy protection requirements (resilience to "link- 
age" attacks) . With respect to this aim we have investigated 
properties of the memory biased random walk model with 
the user-profile sampling procedure. 

Infinite many sequences or paths can be constructed from 
clickstream graph, but with memory biased random walk 
model (MBRW) we sample sequences that are more likely to 




Cosine similarity 

Figure 10: Fraction of synthetic clickstreams having 
similarity to most similar original clickstream above 
some value, for the simulated dataset. 



occur in a system described by DS, CVS statistics. Further- 
more, we sample clickstreams according to sampled user- 
profiles in order to preserve user-preferences in a system. We 
demonstrated that the basic statistical properties of "aligned" 
data generators DS and CVS matrices are preserved in syn- 
thetic dataset if we generate dataset of sufficiently large size. 
We demonstrated that memory property of MBRW model 
increases the probability of choosing the relevant next item 
over all other items in a system. 

In addition to presenting the new algorithm for synthetic 
clickstream generation, we demonstrate that synthetic datasets 
created with MBRW model can be used to learn recom- 
mender models and achieve high recommendation perfor- 
mance on the real data. At the same time this approach is 
amenable to simple mechanism for re-identification protec- 
tion in the sense of the de-anonymization guarantee. 

Important impacts of these results we see for the research, 
whereas by combining high quality replica of usage data and 
item content descriptions more realistic datasets could be 
made available for research. With the same approach com- 
mercial enterprises that rely on outsourcing in recommender 
system development, could benefit through lowering their 
privacy-breaching risks. 
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